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Abstract. Exact-sparsity inducing prior distributions in Bayesian analysis typi¬ 
cally lead to posterior distributions that are very challenging to handle by standard 
Markov Chain Monte Carlo (MCMC) methods, particular in high-dimensional mod¬ 
els with large number of parameters. We propose a methodology to derive smooth 
approximations of such posterior distributions that are, in some cases, easier to 
handle by standard MCMC methods. The approximation is obtained from the 
forward-backward approximation of the Moreau-Yosida regularization of the neg¬ 
ative log-density. We show that the derived approximation is within 0 (^/ 7 ) of 
the true posterior distribution in the /3-metric, where 7 > 0 is a user-controlled 
parameter that defines the approximation. We illustrate the method with a high¬ 
dimensional linear regression model. 


1. Introduction 


Successful handling of statistical models with large number of parameters from 
limited data hinges on the ability to solve efficiently and simultaneously two prob¬ 
lems: (a) weeding out non-significant variables, and (b) estimating the effect of the 
significant variables. The concept of sparsity has come to play a fundamental role 
in this endeavor. In the Bayesian framework, sparsity is naturally built in the prior 
distribution using spike-and-slab priors (Mitchell and Beauchamp| (|1988 ); George and 


McCulloch (1997)), which are mixtures of a point mass at the origin (the spike) and a 


continuous density (the slab). We will refer to such priors as exact-sparsity inducing 
priors. A number of recent works have established that these priors, with carefully 
chosen slab densities, produce posterior distributions with optimal posterior contrac¬ 
tion rates (Castillo et al. (2015); Atchade (2015|>). However, the flip side of such stellar 
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statistical properties is the fact that these posterior distributions are computationally 
difficult to handle, particularly in high-dimensional applications. Deriving tractable 
and scalable approximations for such distributions is therefore a problem of practical 
importance. 

The most commonly used approach for dealing with posterior distributions from 
exact-sparsity inducing priors consists in integrating out the regression coefficients 


(George and McCulloch (1997); Bottolo and Richardson (2010)). Recent results by 


Yang et al. (J2015) have shown that such an approach indeed scales well with the 


dimension of the parameter space. However, it holds the limitation that it does not 
solve both the variable selection and sparse estimation problems jointly. Furthermore, 
it does not easily extend to non-Gaussian slab^J or to non-Gaussian models. Another 
solution to dealing with these posterior distributions is to design specialized MCMC 
samplers, typically using trans-dimensional MCMC techniques such as reversible jump 


MCMC (Chen et al. (2011)), or the newly proposed shrinkage-thresholding Metropolis 


adjusted Langevin algorithm (STMaLa; Schreck et al. (2013)). See also Ge et al. 


(2011) for a specialized sampler for blind-deconvolution models. However, these trans- 


dimensional MCMC samplers currently do not scale well to large problems (Schreck 


et al. (2013)). 


The discussion above suggests that when dealing with exact-sparsity inducing pri¬ 
ors in high-dimensional regression problems, scalable approximation of the posterior 


distribution would be useful. Notice that the Laplace approximation (Tierney and 


Kadane (1986)), one of the most standard approximation tool in Bayesian computa¬ 
tion, cannot be straightforwardly applied when the dimension of the space is as big 


as the sample size (Shun and McCullagh (1995)). Variational Bayes approximations 


recently explored by Ormerod et al. (2014) form a promising solution, but remained 


to be fully explored in high-dimensional settings. 

1.1. Main contribution. We propose a methodology to approximate posterior dis¬ 
tributions derived from exact-sparsity inducing priors. An interesting feature of the 
approximation is that the approximation error is easily controlled by the user. Fur¬ 
thermore, in several important cases, the approximation thus obtained is easily ex¬ 
plored by standard Markov Chain Monte Carlo (MCMC) algorithms. The approxi¬ 
mation is obtained by taking the forward-backward approximation (closely related to 
the Moreau-Yosida approximation) of the negative log-density. The Moreau-Yosida 
approximation is a well-established regularization method in optimization for dealing 


with non-smooth and constrained problems (Moreau (1965); Bauschke and Combettes 


Tor optimal posterior contraction rate, currently available results suggest that one needs slab 
densities with tails heavier than Gaussian. 
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(201l|). Several recent works have recognized the usefulness of the Moreau-Yosida 
regularization for Bayesian computation. Pereyra (2015) noted that a log-concave 
density can be well approximated by its Moreau-Yosida approximation. However, the 
framework developed by Pereyra (2015) does not handle the class of posterior dis¬ 
tributions considered here. Another related work is the STMaLa of ISchreck et all 
(2013) mentioned above, which implicitly uses the Moreau approximation to de¬ 
sign Metropolis-Hastings proposals to sample from posterior distributions with exact- 
sparsity inducing prior distributions. 

If Il(- 1 2;) denotes the posterior distribution of interest on x {0, l} rf given data 2, 
we write II 7 (-|a) to denote the proposed Moreau-Yosida approximation, where 7 > 0 
is a user-controlled parameter that defines the quality of the approximation. We 
derive a general result (Theorem [9]) that shows, under some regularity conditions, 
that 


d/3 (n(-|^),n 7 (-|^)) =0(y/ 7), 


( 1 ) 


where is the /3-metric that metricizes weak convergence (see Section 1.2 for precise 


definition). One challenge with using the proposed approximation is to find values of 
7 for which II 7 is close to n, but not too close so that Markov Chain Monte Carlo 
samplers with good mixing properties can be easily developed for If 7 . In Theorem[8]we 
propose a choice of 7 that strikes the aforementioned balance, as we show empirically 
in the simulation examples. Furthermore, with this particular choice of 7, we show 
that the constant in the big O in (jTJ) degrades with the dimension d at most linearly. 

We illustrate the method using a linear regression model with a spike-and-slab 
prior, where the slab is the elastic net density (Li and Lin (2010)). The example 
has relevance because the posterior distribution thus defined (actually a special case 
thereof) is known to contract at the optimal rate (Castillo et al. (2015)). Our proposed 
methodology produces an approximation II 7 of this posterior distribution, and we 
develop an efficient Markov Chain Monte Carlo algorithm to sample from II 7 . In this 
particular example, we show that the Moreau-Yosida approximation actually scales 
very well with the dimension. More precisely, we derive an upper bound similar to 
(jTj) that degrades at most like log(d) as d —> 00 (see Theorem 10). We illustrate 
these results in a simulation study which shows that the method performs well, and 
outperforms STMaLa for high-dimensional problems. A Matlab implementation can 
be obtained from http://dept.stat.lsa.umich.edU/~ yvesa/Research.html. 

The remainder of the paper is organized as follows. We close the introduction with 
some notation that will be used throughout the paper. In Section [2j we first introduce 
the class of posterior distributions of interest, followed in Section [3] by the basic idea 
of the Moreau-Yosida approximation. In Section [4j we develop how the idea can be 
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applied to approximate the posterior distributions of interest. Section [5] details an 
application to linear regression models. We close the paper with further discussion in 
Section [6j All the proofs are gathered in the Appendix, placed in a supplemental file. 


1.2. Notation. Throughout the paper, d > 1 is a given integer and denotes the 
d-dimensional Euclidean space equipped with its Borel sigma-algebra, its Euclidean 
norm || • ||, and inner product (■,■). We also use the norms ||0||i = f Y^j=i |#j|i and 
||0||o defined as the number of non-zero components of 9. The Lebesgue measure on 
]R rf is written as dx when there is no confusion. 

We set A = f {0, l} rf . For 5 G A, fi$ denote the product measure on defined 
as // 5 (d0) = n-=i us j (dOj ), where z/o(d z) is the Dirac mass at 0, and vi(dz) is the 
Lebesgue measure on R. Hence integration with respect to sets to zero all the 
components for which Sj = 0, and integrates the remaining components using the 
standard Lebesgue measure. 

For 6 R d , 9 ■ r d denotes the component-wise product: [9 • d)j = dj'dj, 1 < j < d. 
For 6 G A, we shall write 9s to denote 9-5, and we set 

R d 5 d = { 9 5 , 9 G M d } = {9 G R d : % = 0 for Sj = 0, j = 1,..., d}. 


We will need ways to evaluate the distance between two probability measures. Let 
(X, dx) be some arbitrary separable complete metric space equipped with its Borel 
sigma-algebra. For any two probability measures ^i ,//2 on X, the /3-distance between 
fi \, jJ /2 is defined as 

d/3(/Ji,H2) d = sup 
II/IIbl<i 


/ dx) - / /(x)// 2 (dx) 

/X Jx 


( 2 ) 


where the supremum is taken over all measurable functions / : X 
Ibl H/Hoo + Wf\\\_ < 1, where 


such that 


oo = f sup | /(x) |, 
xex 


and 


def 

L = sup 


|/(xi) - /(x 2 )| 
dx(^i,x 2 ) 


, Xi,x 2 G X, Xi / x 2 


It is well-known that this metric metricizes weak convergence (see e.g. Dudley 


(2002j) Theorem 11.3.3). If the supremum in ([2]) is replaced by a supremum over 
all measurable functions / : X —> R such that ||/||oo A 1 (resp. ||/||l < 1) one obtains 
the total variation metric d tv (resp. the Wasserstein metric d w ). 


2. High-dimensional posterior distributions with sparse priors 

Let z be a realization of some random variable Z with conditional distribution fg, 
given a parameter 9 G R d . With a prior distribution n on 9, the posterior distribution 
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for learning 9 is 


f[(d0|z) = 


fe(z) n(d0) 


j Rd fe(z)u(dey 

Although the prior distribution II can be constructed in a variety of ways, we focus 
on exact-sparsity inducing priors (spike-and-slab priors). Such prior distributions 
have been recently shown to produce posterior distributions with optimal contraction 
properties (Castillo et al. (2015); Atchade (2015)). More specifically, we consider a 


prior distribution II on A x 


■ of the form 

n(<j,d 0 ) = vr 5 n(d0|<5), 


for a discrete distribution {7r<5, 5 £ A} on A, and a prior n(-|d) that is built as follows. 
Given 5, the components of 9 are independent, and for 1 < j < d, 


0j\S 


if Sj = 0 
if Sj = 1 


( 3 ) 


Dirac(O) 

P(') 

where Dirac(O) is the Dirac measure on M with full mass at 0, and p(-) is a positive 
density on M. By the standard data-augmentation trick, we will take the variable 5 
as part of the posterior distribution. As defined, the support of II(-1<5) is = {9 £ 
: 0j = 0 for Sj = 0, 1 < j < d}, and II(-|<5) has a density with respect to the 
measure ps defined in Section |1.2[ 

B(d6»|A) = e _p ( 0 l 5 ) p s (d9), where 


P(9\S) d = 


-E 7 :5, = l l 0 gP(^) if6) e 


+oo 


otherwise 


In the above formula, and throughout the paper, we convene that e 00 = 0, and 
0 x oo = 0. We also define 


l{6) d = -log/e(z), and h(0\5) d = 1(6) + P(9\S), 9 £ R d , 
so that the posterior distribution writes 

n(<5, dd\z) oc TT S e- hm p 5 (d0). (4) 

Monte Carlo simulation from this posterior distribution can be challenging. The 
issue is related to the discrete-continuous mixture form of the spike-and-slab prior on 
9, which has the effect that any two distributions n(5, -| z) and n(d / , -| z) are mutually 
singular for 6^6'. As a result, if direct sampling from the conditional distribution 
of 9\S, z is not possible, then sampling from Q requires the use of trans-dimensional 


def 


MCMC methods such as reversible jump (Chen et al. (2011)), or STMaLa (Schreck 
et al. (2013)) which is shown to perform better than reversible jump. However 


one issue with STMaLa is that the algorithm has several tuning parameters that 
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are currently poorly understood. Furthermore, as we shall see in the simulations, 
the mixing of the algorithm degrades significantly for high-dimensional problems, 
particularly when the signal is weak. 


3. The Moreau-Yosida approximation 


Our goal in this work is to develop a more tractable approximation to the poste¬ 
rior distribution fl in (jd]) using the Moreau-Yosida approximation. However to make 
the ideas easy to follow, we start with some general discussion of the Moreau-Yosida 
approximation. Let h : —> (—00, +00] be a convex, lower semi-continuous func¬ 

tion that is not identically +00, and let g be a sigma-finite measure on M d . In the 
applications, n will naturally be taken as the Lebesgue measure on the domain of h 
(the domain of h is the set of points x E such that h(x) < 00). Assuming that 
Z = f f Rd e~ h ^n(dx) < 00, we consider the probability measure 

v(dx) = fi{dx). (5) 

Zj 

To fix the ideas, the reader may think of the case where h is finite everywhere and /x 
is the Lebesgue measure on M d . In that case v is the probability distribution on 
with density (1 /Z)e~ h ^ x \ However our main interest is in the posterior distribution 
Q for which the slightly more general setting is needed. 

Suppose that we are interested in drawing samples from u. The lack of smoothness 
of h, and the possibly complicated geometry of the support of v can create difficulties 
for standard MCMC algorithms. A smooth approximation of v can be formed from 
the Moreau-Yosida approximation of h defined for 7 > 0 as 

hry(x) = min 
«eR d 


h(u) H- 11 it — m || 2 

27 


x E 


Under the assumptions imposed on h above, the function h 7 is known to be well- 
defined and finite everywhere. It is also convex, continuously differentiable with a 
Lipschitz gradient, and h^{x) t h(x), as 7 0, for all x E M d . All these properties 

are well-known and can be found in Bauschke and Combettes ( 201 1| ) (Chapter 12). 
Assuming that Z 7 c = L d e~^^dx < 00, it seems natural to consider the probability 
measure 

u 7 (dx) = -Z e -^i( x )^ Xj 

Zry 

as an approximation of v. To the best of our knowledge, the approximation z > 7 was 
first considered by Pereyra (2015), for a probability distribution u for which h is finite 
everywhere and /x is the Lebesgue measure on W. d . And we refer the reader to that 
paper for a good discussion of the basic properties of u 7 , and how well it approximates 
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p. In particular Pereyra (2015) showed that the smoothness of hy can be exploited to 
derive efficient gradient-based MCMC samplers for p. An important limitation of the 
Moreau-Yosida approximation is that it is typically not available in closed form, and 
its computation leads to a d-dimensional, possibly complicated optimization problem. 

In many problems the function h takes the particular form 


h(x) = £(x) + P(x), x E 


where l is convex, finite everywhere and twice continuously differentiable, and P is 
convex, not identically +oo and lower semi-continuous. In such cases, one can approx¬ 
imate l around a given point x by its linear approximation u i-a £{x) + (V£(x),u — x), 
where V£(x) denote the gradient of £ at x. This approximation leads to the so-called 
forward-backward approximation of h, defined for 7 > 0 as 


, / \ def 

hy[x) = mm 


£(x) + {V£(x),u — x) + P(u) + — ||u — x || 2 

2 7 


, x E 


= £(x) H-- 1 | V€(a 7)|| 2 + min 

2 ue R d 


1 


P(u) + —llu — x + 'y\/£(x)\\ 2 
27 


( 6 ) 


Under the assumptions imposed on £ and P above, the function hy is finite everywhere, 


continuously differentiable, and hy < h. These properties can be found in Patrinos 


et al. (2014) Theorem 2.2, but are easy to derive. For instance, the differentiability 


follows from the expression ([ 6 ]) , the twice differentiability of £, and the differentiability 
of the Moreau-Yosida approximation of P. Notice however that hy is no longer convex 
in general. Assuming that Zy = f f Kd e^ h ' l<x hlx < 00 it seems also natural to consider 
the resulting approximation of v defined as 

Uy(dx) = -^-e~ hl ^dx. 

Zj ry 


The main advantage of hy over hy is that in many problems of interest hy is available 
in closed form, whereas hy is not. Furthermore, if the function P is separable, then 
the computation of hy leads to d separate one-dimensional optimization problems. 
However, the price to pay for the computational convenience is that hy may not be 
convex, and it is a less accurate approximation of h. Indeed, by the convexity of £, 
we have £{u) > £(x) + {V£(x),u — x) for all u E M d . Hence hy(x) < hy(x) < h{x) for 
all x E M rf . As we will see, the convergence hy(x) f h(x), as 7 | 0, for all x E M d , 
still holds. Figure [l] gives an illustrative example of the differences between hy and 
hy and how both functions approximate h. 

Since hy converges pointwise to h as 7 | 0, it seems natural to expect that Vy 
approaches v for small 7. If the function h is finite everywhere, one can easily show 
(see Proposition [I] below) that indeed, Vy converges to v in the total variation metric, 
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7=5 7=1 7 = 0.1 





FIGURE 1. Figure showing the function h(x) = —ax + log(l + e ax ) + b\x\ 
for a = 0.8, b = 0.5 (blue/solid line), and the approximations /i 7 and h 7 
(/i 7 < hj), for 7 £ {5,1,0.1}. For 7 = 0.1, the curves of h 7 and h 7 are 
almost undistinguishable on the figure. 

as 7 }, 0. However this result is no longer true when the domain of h has zero R rf - 
Lebesgue measure. In this latter case, we will show that the convergence of is 7 occurs 
only weakly, or in the Wasserstein metric. 

Proposition 1 . Suppose p in |5j) is the Lebesgue measure on R d , h = £ + P is 
convex, finite everywhere, and /i 7 (x) t h(x) for all x £ R rf . Suppose also that there 
exists 70 > 0 such that Z l0 < 00 . Then for all 7 £ ( 0 , 70 ], is well-defined, and 

dtv{v 7 , v) <2^1- I 0 , as 7 I 0 . 

Proof. See the Appendix. □ 

Remark 2. (1) Notice that Proposition [I] can also be applied to is, y by taking 

£ = ( 1 . 

(2) We show in Lemma 2 in the Appendix that if £ is finite everywhere and differ¬ 
entiable, and P is finite everywhere, convex with a nonempty subdifferential 
at x for all x £ R d , then h 7 t h, as required in the proposition. 

If the domain of h has R d -Lebesgue measure 0, then is and is 7 are then automat¬ 
ically mutually singular and Proposition [l] cannot hold. The following toy example 
illustrates this case. 

Example 3. Suppose that we take R^ = R, £ = 0, and we take P(x) = 0 is x = 0, 
and P{x) = +00 if x 7 ^ 0. In that case e~ h ^ = 1 if x = 0, and e~ h ^ = 0 if x 7 ^ 0. 
Let r = So be the point mass probability measure at 0. Hence is = 5 0 . For 7 > 0 , 
h 7 (x) = h 7 (x) = x 2 /( 2 7), x £ R. Hence is^ is the normal distribution N(0,7). It 
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follows that dtv(u 7 , v) = 2, for all 7 > 0. But for any Lipschitz function / : M —> M 
with Lipschitz constant 1, 




where Z 7 ~ N(0,7). By taking / = | • |, it can be easily seen that d w (p 7 , v) = y 
Hence u 7 converges in the Wasserstein metric to v, but not in total variation. And 
the convergence rate is 0{y/ 7 ). 

Remark 4. The fact that we only have convergence in the Wasserstein metric has 
practical implications. It implies that one needs to be cautious about what probability 
u(A) can be well approximated by 17 , (A). For instance, in Example [3j if A is of the 
form (a,0] or [0,6], then u(A) = 1, whereas lirm^o ^(A) = 0. 

In the next section we will use the approximating measure u 7 introduced above to 
approximate the posterior distribution We will see that the situation is similar 
to the one in Example [3j and as in that example we will show that the approximation 
converges weakly to the posterior distribution II. 

4. The Moreau-Yosida approximation of the posterior distribution Q 

In this section, we return to the posterior distribution Q defined in Section[2j And 
we make the following assumptions on the functions £ and P. 

HI. (1) The function 9 1 —> £{9) is finite everywhere, convex, and twice continu¬ 
ously differentiable. 

(2) For all 5 £ A, the function 9 1-7 P(9\5) is convex, lower semi-continuous, not 
identically + 00 , and admits a sub-gradient g{9\8) at 9, for all 9 £ M^. 

Remark 5. (1) The convexity assumption on £ is fundamental and delineates 

the type of problems to which the proposed approximation could be easily 
applied. Extension beyond this set up is possible, but will require fundamen¬ 
tally different techniques. 

(2) The convexity of P(-\8) boils down to the log-concavity of the density p in the 
prior Q. Most of the sparsity promoting prior densities used in practice are 
log-concave. 

Given 5 £ A, we consider the forward-backward approximation of h(-\5) defined as 


h~,(9\5) = min £(9) + (Vi(9). u - 9) + P(u\5) + — ||u - 9\\ 2 , 9 £ R d , (7) 
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for some parameter 7 > 0. Using h 7 , we propose to approximate the posterior 
distribution II in Mb by 


n 7 (< 5 , d0\z) oc tts ( 2717 )^ e~ h ^ m d9, 


( 8 ) 


that we call the Moreau-Yosida approximation of II, although as we have seen above, 
([7]) is only the forward-backward approximation of h(-\5). In the expression (J8|, 
7 r denotes the irrational number. The function h 7 (- |<5) is available in closed form 
whenever the Moreau-Yosida approximation of P(- |<5) has a closed form expression. 
More specifically, for 5 £ A, and for 7 > 0, we define 


P 1 (9\5) = f min 


PH5) + T||u-9f 
2 7 


(9) 


the Moreau-Yosida approximation of P, and its associated proximal map 


def 


Prox 7 (#|<5) = Argmin ueR a 


pm + Pu-ef 

27 


9 £ 


From the definition of P 7 and Prox 7 , we see that h 7 can be alternatively written as 


where 


h,m = £(9) — ^||VT((9 )|| 2 + P 7 (0 — 7 W(0)|<5) 

= i{9) + (V09), J 7 (0|5) - 9) + P(J 7 (0 \5)\S) 


J 1 {9\ 5) d = Prox 7 (9 - 7 V£(0)|<5). 


( 10 ) 

(ID 


For 7 > 0, 9 G R d , let s 7 (0) £ be such that 


def 


(s~/(9)) 3 = Argmin 

uSK 


1 


-logp(u) + — {u-OjY 


1 <j<d- 


Then it is easy to check that Prox 7 (@|<5) = S ■ s 7 (9). Hence by Equation , we 
see that h 7 (-\5) is computationally tractable if the map s 7 (the proximal map of the 
negative log-prior) is easy to compute. Although this limits the applicability of the 
method, there a several priors commonly used for which this holds, including the 
Laplace prior and the elastic-net prior given respectively by 

p(u ) oc e -A ^, and p(u) oc exp aAi|rt| — (1 — a)\ 2 —^ , 

as well as the generalized double Pareto of Armagan et al. (2013), and the (improper) 
prior distribution that arises from the MCP of Zhang (2010), given respectively by 


1 


P(u) = — 1 + 


2A 


\u\ 


aX 


-(“+!) 


and p(u) = exp —A 


r-M 


1 - ~~r 1 d t 

aX. 
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4.1. Connection with spike-and-slab priors. The proposed approximation II 7 is 
closely related to the distribution obtained by replacing all the Dirac mass in ([3]) by 
independent Gaussian distributions N(0, 7 ), 7 > 0. More precisely, let IT 7 denote the 
posterior distribution of (5, 9) in the following model. 


<5~{7T5}, 6j\5 


N(0, 7 ) if Sj = 0 

P(-) if 6 i = 1 


, 1 < j < d, and Z\8,0 ~ fg s 


( 12 ) 


Notice that in (12) given (5, 9), we draw Z from fg s , with a sparse parameter 9s- The 


posterior distribution thus defined is 


jrMi 

n 7 (J,d0|z) ocvr^ ( 2 e -27 ||0 - 0 dl 2 e -M^|5)d0. 


(13) 


The distribution II 7 in turn, is closely related to another posterior distribution com¬ 
monly used in practice and obtained from the following model: 


8 ~ {tea}, 9j\S 


N(0, 7 ) if Sj = 0 
P(-) 


if Sj = 1 


, 1 < j < d, and Z\6, 9 ~ f e , (14) 


for some constant 7 > 0. Here given (8,9), we draw Z from fg. Model (14) is 


widely used in practice as a more tractable alternative to the point-mass spike-and- 


slab (jGeorge and McCulloch 

(1997); 

Ishwaran and Rao 

(2005) 

Rockova and George 

(2014); 

Narisetty and He 

(2014)). Clearly, to the extend that the point-mass spike- 


and-slab is the gold-standard, the model in (12) is preferable to the one in (14). 


The Moreau-Yosida approximation proposed in this paper can be viewed as a very 
close approximation to n 7 , as we show that dt v (n 7 ,n 7 ) = 0( 7 ) (see Lemma 3 in 


the Appendix, and (20)), where d tv denotes the total variation metric. The interest 
of our method then comes from the fact that sampling from fl 7 is much easier than 
sampling from n 7 . Indeed, notice that in n 7 , the parameter 5 appears also in the 
likelihood function £(9s)- As a result, both conditional distributions 8\z,9 and 9\z,8 
are typically intractable and require MCMC algorithms. Whereas in II 7 , given (z, 9), 
the components of 8 are independent Bernoulli random variables. 


4.2. Approximation bounds. We will now derive a result that bounds the (3- 
distance between fL, and IT We define 


6l (z) =Hog / e r ^ e) fl 1 (d8,d9\z), 


(15) 


Try(8,9) (V£(9) - V£(0g), 0 - 9s)) + ^\\8- W(0) + 8 ■ g (9 S \8) 


where 
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For simplicity, we shall omit the dependence of r 7 (5,9) on z (same with 1(9) and 
V£(9)). We note that by the convexity of £, r 7 (< 5, 9) > 0. Hence g 7 (z) > 0. 

Theorem 6 . Assume m for some fixed data z. Suppose that there exists 70 > 0 
such that n 70 (-|z) is well-defined. Then for all 7 € ( 0 , 70 ], n 7 (-|z) is well-defined and 

dp (lL r (-|z),n(-|*)) < V7^ + 2(l-e"^ (2) ) • (16) 

Proof. See the Appendix. □ 


Notice that 1 — e~ x < x for all x > 0. Therefore, the convergence to zero of 
(ll 7 (-|z), n(-|z)) would follow if the term g 7 (z) converges to 0 as 7 -» 0. In 
the next result we impose some additional assumptions, which, together with m 
guarantee that g 7 (z) converges to zero. In the process we derive an explicit bound 
on the convergence rate which can be used to develop guidelines for choosing 7 . We 
make the following assumption. 


H2. (1) There exists L\ < 00 such that, 

||W(0i) - W(0 2 )|| < ii||0i - 021|, 01, 02 € M d . (17) 

(2) There exists L 2 < 00 such that 

\\5-V£(9)\\ 2 <2L 2 £(6), 5 e A, 6 eR d 5 . (18) 

(3) For all 5 6 A, there exists c(5) < 00 , such that 

\\5-g(0\6)\\ 2 <c(5) + 2L 2 P(6\6), 9 e R d s . (19) 

Remark 7. Hj2^(l) is a standard Lipschitz assumption. 1^2]- (2) and 1^2]- (3) essentially 
requires both functions £ and P(-\5) to grow like O(||0|| 2 ), or o(||#|| 2 ), as ||0|| —> 00 . 


Theorem 8 . Assume mm for some fixed data z, and suppose 7 > 0 is such that 
47 max(Li, L 2 ) < 1. Then IT 7 (• |^) is well-defined and 

dp (n 7 (-| 2 ),n(-| 2 )) < y^d + 2 (1 - e-*«), 


where 


g 7 (z) < 37 


- maxc(d) + d(L\ + 2 L 2 ) + L 2 1Z(z) 

Z SE A 


where TZ(z) = maxjgA inf 0gK d [£(0s) + P(0s|(5)] < max^l^O) + P(0|5)]. 


( 20 ) 


Proof. See the Appendix. 


□ 
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Since 1 — e~ x < x, for all x > 0, Theorem [ 8 ] shows that as 7 —> 0, the Moreau- 
Yosida approximation fl 7 (-|z) approaches n(-|z) at the rate of y/ 7 , under tjljand I^2j 
The rate is optimal as Example [3] shows. Theorem [ 8 ] also provides some guidelines 
for choosing 7 , as it suggests that one can choose 7 as 


7 = 


7o 


max(Li, L 2 ) ’ 


with 


0 < 70 < -j-- 


( 21 ) 


The bound in (20) seems to suggest that the quality of the approximation resulting 
from choosing 7 as in ( 21 ) degrades only linearly with the dimension d, as d increase^} 


However, it is important to realize that the bound in (20) is most likely not tight, and 


the dependence of g y (z) on d could be even better than 0(d) (see the linear regression 
example below). In general we cautious against the use of too small values of 7 , since 
choosing II 7 very close to II limits the ability to construct good MCMC sampler to 
explore 1 I 7 . 


5. Application to Bayesian linear regression with sparse priors 


As an application we consider a high-dimensional linear regression problem, with 
dependent variable z E M n , and design matrix X E M nxd . The variance term a 2 * is 
assumed known. The negative-log-likelihood function f. for this problem can be taken 
as 

e(9) = ±\\z-xe\\*, 6eR d . 

za~ 

We will set up the prior distribution of 8 using 5 E A, and using an auxiliary variable 
4> = (q, Ai, A 2 ), where q E (0,1) is a sparsity parameter, and Ai > 0, A 2 > 0 are regu¬ 
larization parameters. Given (j), we assume that the components of 5 are independent 
and identically distributed, with distribution Ber(q). Hence its = qll 5 ll° (1 — q) d_ Pllo. 
Given (j) and 6, the components of 8 are independent, and for 1 < j < d. 


6j\6, 0 ~ 


Dirac(0) if 5j = 0 

EN (sL !*) if A = 1 


where Dirac(0) is the Dirac measure on M with full mass at 0, and EN(Ai/cr 2 , A 2 /U 2 ) 
is the (elastic-net) distribution with density given by 


m exp 



(1 - a) 



lEl, 


( 22 ) 


2 Indeed, the term TZ(z) always satisfies 7 Z(z) < maxj[t(0) + P(0|<5)], and this latter expression 

typically does not grow with d. 
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for a parameter a E [0,1], assumed known. The normalizing constant Z(cj>) can be 
written as 


m 


G\i n 2 T, -erfcx f— , f Xl ^ 
V P“ Q ) A 2 \ v o- A /2(l-a)A 2 y 

2cr 2 

Ai 


if a E [0,1) 

if a = 1 


where erfcx(x) is the scaled complementary error function, which can be written as 
erfcx(x) = 2e a;2< f > (—\/2x), where <h is the cdf of standard normal distribution. The 
prior density (22) is a reparametrization of the elastic-net (Zou and Hastie (2005)) 
prior used by Li and Lin (2010). Notice that a = 1 makes A 2 inactive, and setting 
a = 0 makes Ai inactive. 

Given the prior specified above, the function P becomes 


pm = ||i||,logZ(« + + (1 J 2 )A2 Hfel| 2 + t»;(0), e € R J , 

where t K d( 0 ) = 0 if 9 E M^, and i K d (9) = +00 otherwise. We recall that = {9 E 
: 9j = 0, if Sj = 0, j = 1 With h(9\S) = £(9) + P(0|5), the posterior 

distribution of (S, 9) is 

LI (<5, d9\z) oc 7 Tse- h ( e \ s )ii$(d9). 

With the elastic net prior ( |22[ ), the proximal function Prox 7 (0|<5) is easy to compute. 
For x E M, dehne sign(x) as 1 is x > 0, —1 if x < 0 and 0 if x = 0. For 7 > 0, let 
Sy( 9 ) E denotes the vector whose j-th component is given by 


(s y(9))j 


sign (9j) (|0j| -e*7^) + 
1+7^(1-a) 


(23) 


It is easy to show that 

Prox 7 (0|<5) = S ■ s 7 (9), 

where 9\ • 02 denotes the component-wise product. From Q, it follows that the 
Moreau-Yosida approximation II 7 of II has a density 7 r 7 given by 


7f 7 ( 6 , 9 \z) oc irg (2-717) 2 e 


(24) 


where h 7 (-\S) is given by ( | 11 [ ). In the next result, we show that m and 1 ^ 2 ] hold for 
this problem, and Theorem [ 8 ] applies. For a matrix A, let A max (^4) denote its largest 
eigenvalue. 


Corollary 9. Suppose that (1 


ct)A 2 < X max (X'X), and suppose that 7 > 0 satisfies 
^\ max (X'X) < 1. (25) 
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Then for all z £ M n , n 7 (-|,z) is a well-defined, probability measure on A x M d , and 

(n 7 (-| 2 ),n(-| 2 )) <^d + 2 (1 - e-^w), 

where g^(z) satisfies 

Q-fiz) < d + ~^X max {X'X) ^3d + 


2 cj 2 


Proof. See the Appendix. 


As in the general case above, (25) suggests choosing 

r 2 


7 = 


70<T 


70 G (0,1/4], 


(26) 

□ 

(27) 


A max (X'X) ; 

And with this choice, the bound in (|26|) deteriorates only linearly in d. In fact, (26) 


is a worst case analysis and better bounds can be derived if one takes into account 
the sampling distribution of the data z. We prove one such result below. 

We shall take the frequentist viewpoint and assume that the observed data z is a 
realization of Z where 


Z = X6+ + e, 


where 


N( 0 , <J 2 I n )i 


(28) 


for a sparse unknown vector 0* £ R rf . In the Bayesian as in the frequentist framework, 
the recovery of 6+ when d > n depends on the positiveness of some restricted and 
sparse eigenvalues of X'X. We define these quantities next. Let <5* £ A, denote the 

def 

sparsity structure of 0*. That is, <5*j = 1 if and only if |0*j| > 0. We set s* = ||#*||o> 
the number of non-zero components of #*. We define 

def. .(d'(x'x)e 


k = inf 


n 


■ 9 / 0, \\6 — 0«sj|i < 7||0$J|i > , 


and s £ { 1 ,..., d}, we define 


-/ \ def 
K{S) = SUp 


6'(X'X)0 


1 < 


n 


^ r 0 ^ 


< s 


Finally, we define 


r def , 

c = < ^ £ 


: max I (Xk, z — X6f) I < Ai/2 

l<k<d 


Theorem 10. Assume (28), with a design matrix X that satisfies n > 0. Choose 
a = 1 , Ai = 4cr^nK(l) log(d), and ns = q^^°(l — where q = d~ u , for some 

constant u > 1. Suppose that 7 > 0 is small enough so that 

—nX max (X'X) < 1, and 24n«(l)7 < a 2 (u — 1). 
a z 
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Then if d > 2, P(Z € £) > 1 — 2/d, and for Z E £, 

E [ Ql {Z)\Z G e\ < - log (l - 0 + 41og(4) + s* log (l + 

+ us* log(d) + «(1) log(d) + ^ ( n\ max (X'X ) + Tr(X'X)) . (29) 

Proof. See the Appendix in the Supplement. □ 


Remark 11. To give some context, the theorem considers the posterior distribution 
n(-|Z), with a = l (Laplace prior), and Ai = 4<jy/nR(l) log(d) in (22), and with 
its = qll <5 ll°(l — q) rf— Iloilo^ where q = d~ u , for some constant u > 1. It has recently 
been shown by Castillo et al. (2015) that with the above choices, the 0-marginal 
of this posterior distribution contracts to a point-mass at 0 * at the optimal rate 
0(yG* log(d)/n). Theorem 10 gives a bound on the average error of the Moreau- 


Yosida approximation of this posterior distribution via 

E [d*? (RyOlZ), n(-|Z)) \Z G £] < y^d + 2 E [Qj(Z)\Z e £\, 


for Z G 5. 


The right-side of (29) does not converge to zero as 7 -A 0. However, it provides 
some useful insights. We see that if we choose 7 > 0 as in ( |27| ), then the right- 
side of (29) grows with d at most like log(d) + Tr(Y / Y)/A max (Y / Y). From random 


matrix theory it is known for several classes of random matrices that for fixed n, 
Tr(Y , X)/A max (Y , X) is typically 0(1) as d —> 00 . An inspection of the proof of 


Theorem [10] suggests that the dependence of the right-side of (29) on the sample size 
n can be improved. We leave this for possible future work. 


5.1. Dealing with the hyper-parameter (f>. We use a fully Bayesian for selecting 
the hyper-parameter <f> = (q,Ai,A 2 ). We assume independent priors such that q ~ 
Beta(l ,d u ) for some constant u > 1, Ai ~ U(a, M), and A 2 ~ U(a,M) for some 
small positive constant a (we use a = 10 - ' 5 in the simulations), and for a large positive 
constant M such that (1 — a)M < A max (X / Y). If 7 > 0 is such that (25) holds then 
the /3-distance between the resulting posterior distribution and its Moreau-Yosida 
approximation satisfies the same bound as in Theorem |9j 


5.2. Markov Chain Monte Carlo. The density 7 f 7 in (24) is a “standard” den¬ 


sity, and various MCMC schemes can be used to sample from it. We propose a 
Metropolized-Gibbs strategy. 
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5.2.1. Updating 5. Given 8 and 0, it is easy to see that hy{9\5) depends on 5j only 
through the expression 


(W(0)) i d i + logZ(0) + 


a\i\dj\ + 0.5(1 — a)X-2d‘j dj — 28jd 




2 7 


where dj is the j-tli component of s 1 {8 — 7 V£($); X\/a 2 , X 2 /C 1 2 ). Hence, we update 
jointly and independently the 5j by setting Sj = 1 with probability e r /(l + e r ), where 

1 


r = log — 1 -1- ^ log( 2 vT 7 ) 

1 — q 2 


(V^(0)) J -d i + logZ(^) + 


aAi|cZj| + 0.5(1 — a)X2 dj dj — 28jd 


3 3 


a- 


2 7 


5.2.2. Updating 8. Given 5 and q !>, we update the components of 8 using a mix of 
an independence Metropolis sampler, and a Metropolis Adjusted Langevin algorithm 
(MaLa). The MaLa strategy needs some motivation. Although its definition might 
perhaps suggest otherwise, the function _P 7 in ([9]) is actually differential (Bauschke 
(2011) Proposition 12.29) and for all 8,H £ M. d , 


and Combettes 


V e PJd\5) ■ H = — (9 — Prox~(8\5),H). 

7 


And since l is twice continuously differentiable in this example, the expression (10) 
shows that /i 7 is in fact differential and for all 8, H £ M d , 


• H 


To avoid dealing with second order derivatives, and since 7 is typically small, we make 
the approximation I d — r y'V^£(9) « Id, and therefore, we approximate Vghy(9\5) by 

c 


Voxels) ■ H = (X7£(9),H) -7{Vl(8)M 2 h(8) -H 

+ (^{8- 7 V^(0) - J 1 (8\5, </>)), (l d - 7 V^(0)) 

= -(e- J 7 (0|<5,0),(/ d - 7 v( 2 V(0)) • 


G 7 (0|5) d ^ f l -{8- M8\6)), and Gy{8\5) = 


0^815), (30) 


for a positive constant c. The function G 7 is introduced for further stability, in 


the spirit of the truncated Metropolis adjusted Langevin algorithm (see e.g. Atchade 


(2006)). Hence, given 5 and </>, one can update the components of 8 using a Metropolized- 
Langevin-type algorithm where the drift function is given by the corresponding com¬ 
ponents of Gy. This algorithm is similar to the proximal MaLa of Pereyra (2015). 

However, when 5j = 0, the corresponding component of Gy(6\S) is dj /7 and is 
typically very large and not very informative (particularly for 7 small). To deal with 
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this, we use the following strategy. We update jointly the components 6j for which 
5j = 1 using the MaLa algorithm outlined above. Then, we group together all the 
components for which Sj = 0 and we update them jointly using an independence 
Metropolis sampler. The proposal density of the Independence Metropolis sampler 
is built by approximating Jy(6\5) by Prox 7 (0|d). This approximation makes sense 
because, for 7 ~ 0, Jj(6\S) = Prox 7 (0 — 'yXl(0)\S) « Prox 7 (0|d). 

To explain the detail of the independence sampler, let 6$ = ( Oj , j : s.t. Sj = 1), 
u = (Oj, j : s.t. Sj = 0), and let us represent 6 by the pair ( 6$,u ). Let h^(0s,u\S) be 
the function obtained by replacing J-y(0$,u\S) by Prox 7 (0,5, u|5) in the expression of 
h~y(0$,u\5). Because, Prox 7 (0£, u|d) does not actually depend on u , we have 

h^(0s, u|<5) = £(0) + (V£(0), Prox 7 ( 05 , u|5) — 0) + —1| Prox 7 (0,5, u|d) — 6\\ 2 + const. 

1 2 

2 2 ||^ XfiOjj 

- k (z — Xs0$ — Xgcu, XCProx~(6s, u|<5) — 5 ■ 9) + Xgcu) 

a z 

1 „ „ 2 

H- \\u\\ + const. 

27 

It is then easy to see that u 1-7 i s proportional to the density of the 

Gaussian distribution 

N ^EX' S cX (Prox 7 (0|d) - 5 • 6) , 7 s) , where S d = (l|| 5 c|| - ^X' Sc X S c ) _1 , 

where 5 C is the vector 1 — 6, and for any S € A, X$ G M nx ll <5 ll denote the sub-matrix 

of X obtained by selecting the columns for which Sj = 1. Notice that under the 

2 

assumption 7 < ^— \x'X) i the ma t r ix S is always positive definite. The acceptance 
probability of this independence sampler is 


min 


exp yh.y( 0 t 5 , u’\8) 
exp (h~/(0s, u\S ) 



We found this independence sampler to be extremely efficient, with an acceptance 
probability typically above 90%. 


5.2.3. Updating 4> = (< 7 , Ai, A 2 )- We update q ~ Beta( ||<5|| 1 + l,d + d u — ||<5||i), and 
we update (Ai,A 2 ) jointly using a Random Walk Metropolis algorithm with Gauss¬ 
ian proposal. For improved mixing, we adaptively tune the scale parameter of the 
proposal density. 
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5.3. Simulation results and comparison with STMaLa. We illustrate the method 
with a simulated data example. All the computations in this example were done using 
Matlab 7.14 on a 2.8 GHz Quad-Core Mac Pro with 24 GB of 1066 DDR3 Ram. 

We set n = 200, p = 500 and we generate the design matrix X by simulating the 
rows of X independently from a Gaussian distribution with correlation p^~^ between 
components i and j. We set p = 0.9. Using X, we general the outcome z = X6+ + ere, 
with <7 = 1 that we assume known. We build 6 * by randomly selecting 10 components 
that we fill with draws from the uniform distribution eU(v/2,3v/2), where e = ±1 
with probability 1/2, all other components being set to zero. We consider two cases 
for v: v = 1 (that we refer to below as SCENARIO 1), and v = i^log (d)/n ~ 0.18 
(SCENARIO 2). SCENARIO 2 is obviously more challenging since the average strength 
of the signal is at the limit of what is detectable. 

We set 7 = ^a 2 /\ max (X'X) as prescribed by (27) with two choices of 70 : 70 = 0.25, 
and 70 = 0 . 01 . 

We compare these two samplers to the STMaLa sampler of Schreck et al. (2013). 
The comparison is slightly tricky because STMaLa uses a different prior, namely a 
Gaussian “slab” prior. However, we expect both posterior distribution on (<5, 6) to be 
close, and we expect (5*,0*) to be close to the center of both distributions. For the 
STMaLa, we use the Matlab code provided online by the authors, with the default 
setting. Unlike our approach, this sampler requires the true value of the sparsity 
parameter q, which we provide. We also edit their code to return the summary 
statistics presented below. 

We evaluate the mixing of these samplers by computing the following two metrics 
along the MCMC iterations: the relative error and the T-score (to evaluate structure 
recovery), defined respectively as 


£« = 


|| 0 (fc) - 0 * 
110*11 


and X (k) = 


2 x SEN( fc) PREC(£;) 
SEN (fc ) + PREC(fc) ’ 


where 


SEN^ 



, PREC (k) 


^-d = 11 {|ef ) |>o} 


(31) 


In stationarity we expect values of 8^ (resp. X^) to be close to zero (resp. one). 
In the absence of a better metric, we will graphically access the mixing time of the 
samplers by looking at how quickly the sequence 8^ (resp. X^) converges towards 
zero (resp. one). In order to account for the computing time, and for better compar¬ 
ison, we plot these metrics, not as function of the iterations k, but as function of the 
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gamma_0=0.25 

gamma_0=0.01 

stMaLa 


20 


I 

40 


I 

60 


80 


Time (in seconds) 



Figure 2. Relative error and structure recovery as function of time 
in SCENARIO 1. Based on 30 MCMC replications. The curves are 
sub-sampled to improve the readability of the figure. 


computing time needed to reach iteration k. For further stability in the comparison, 
we repeat all the samplers 30 times and average the two metrics and the computing 
times over these 30 replications. 

All the chains are initialized by setting all components of 9 (and <5^) to zero. 
We run the samplers for a number of iterations that depends on 0*. In SCENARIO 
1, we run the newly proposed sampler for 10,000, and we run STMaLa for 120,000 
iterations. In SCENARIO 2, we run our proposed sampler for 40,000, and we run 
STMaLa for 250,000 iterations. 

Figure [2] and [3] present the results. First, we observe that that 70 = 0.25 mixes 
significantly better than 70 = 0.01. We notice also that fl 7 approximates (#*,£*) 
only slightly better when 70 = 0.01 compared to 70 = 0.25. Overall, we found that 
70 G (0.1,0.25) produces a very good approximation. 

We also look at the usual sample path mixing of the proposed sampler by plotting 
the trace plot, histogram, and the autocorrelation plot from a single run of the sampler 
(Figure [4]). Here, we consider only SCENARIO 1, and we set 70 = 0.25. We look at 
the MCMC output {6j k \ k > 0 }, for one component j for which Sj = 0 , and for one 
component j for which 6j = 1. From this sample path perspective, the plots suggest 
that the proposed MCMC sampler has a good mixing. 

5.4. Empirical Bayes implementation and further experimentation. A lim¬ 
itation of the methodology is that a 2 is assumed known, which is rarely the case 
in practice. We explore by simulation an empirical Bayes solution whereby a 2 is 
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Figure 3. Relative error and structure recovery as function of time 
in SCENARIO 2. Based on 30 MCMC replications. The curves are 
sub-sampled to improve the readability of the figure. 
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Figure 4. Trace plot, histogram, and autocorrelation plot, from one 
MCMC run, using 70 = 0.25. Top row is for a component j for which 
the true value of 5j is 0. Bottom row, true value of 5j is 1. 
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estimated from data. Following Reid et al. (2013) we estimate a 2 by 


~ 2 _ 


1 


n-s Xn 


2=1 


5^ (y* - X S\, 


where (3\ is the lasso estimate at regularization level A, and A n is selected by 10 - 
fold cross-validation, and where s Xn is the number of non-zeros components of f3 Xn . 
In the cross-validation, we choose A n as the value of A that minimizes the MSE. 
This leads to the empirical Bayes Moreau-Yosida posterior approximation that we 
denote II 7 (-|z, <r^). We do a simulation study using a semi-real dataset to compare 
the distributions fl 7 (-| z,a 2 ) and Fl 7 (-|z) (with the true value of a 2 set to one). We 
use the colon dataset (Buhlmann and Mandozzi ( |2014 )) downloaded from 
http: //stat. ethz. ch/~dettling/bagboost. html. The data gives microarray gene 
expression levels for 2, 000 genes for n = 62 patients in a colon cancer study. We 
randomly select a subset of p = 1,000 variables to form a design matrix X E ]j 62 xi,ooo. 


Following Buhlmann and Mandozzi (2014), we normalize each column of X to have 
mean zero and variance unity. We simulate a sparse signal vector 0* £ with 
s = 5 non-zeros components, and where the non-zeros components are drawn from 
U(—v — 1, — v) U (v, v + 1). We consider two scenarios: v = 1 and v = 3. Using X and 
0*, we generate z = X0 * + ae, with a = 1, and e ~ N(0, I n ). 


We set 7 as in (27) with 70 = 0.25. We evaluate the samplers along the same 


metrics 8 and T. We average the results over 30 replication^] of the samplers, where 
each sampler is run for 50,000 iterations. The result is presented on Table [l] We 
notice that the recoery of 0* is poor in both cases when v = 1. When the signal 
is strong (v = 3), the empirical Bayes posterior distribution performs well, but as 
expected, under-performs the posterior distribution with known variance. 



Weak signal (v = 1) 

Strong signal (v = 3) 


EB 

True a 

EB 

True a 

Relative error (in %) 

97.3 

91.7 

12.4 

9.4 

F-score ( in %) 

14.5 

25.1 

79.6 

88.5 


TABLE 1. Table showing the posterior estimates (N — l?) -1 J 2 k = B + \ 8 ^ k \ 
and (N — B)~ x J 2 k= B +i averaged over 30 MCMC replications, each 
MCMC run is 5 x 10 4 iterations. 


^here only X and 0* are kept fixed. For each replication, the dataset z is re-simulated, and is 
re-estimated. 
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6 . Further Discussion 

In this work we have developed and analyzed a smooth approximation to high¬ 
dimensional posterior distribution using the Moreau-Yosida envelop. The method¬ 
ology can be readily extended to other high-dimensional statistical models (linear 
and generalized linear regression models, graphical models, sparse PCA, and others). 
Several theoretical issues remain. We have discussed some of these issues above. One 
important problem that we did not directly address concerns the mixing properties 
of the proposed MCMC algorithms, and the trade-off inherent to the methodology 
between good approximation properties of fl 7 , and good mixing of gradient-based 
MCMC simulation from fl 7 . Another potentially interesting direction of research 
is the idea of treating 1 I 7 itself as a quasi-posterior distribution, and investigating 
directly its posterior contraction properties. 


7. APPENDIX: Proof of the main results 

For convenience, we introduce the product space 0 = A x that we implicitly 
equip with the metric d§(0i, 0 2 ) = f \/||5i - A 2 ||§ + ||#i - foil 2 , % = 3 = 1 , 2 . 


7.1. Proof of Proposition [lj For all x £ R d , and 7 £ ( 0 , 70 ], e h ^ < e < 
e ~ h i 0 { x ) _ Hence Z < Z 7 < Z 70 < 00 . Since ji is the Lebesgue measure on M rf , we shall 
write it as dx. For any bounded measurable function / : —> M, we have 


n 7 (/)-n(/)| < 



dx 


< [ fe-^W - dx 

■^7 JR d ' ' 

- 2||/|| “ 0 - i) • 

The fact that Z 7 —> Z as 7 | 0, follows from Lebesgue’s monotone convergence applied 
to e~ h ^o — e~ h ~<. 


7.2. Proof of Theorem [6| We work on the product space 0 = A x introduced 
above. Throughout the proof, we assume that z is fixed, and at times we write fl(-|z) 
simply as IT Same for IT 7 (-1^;) and fl 7 (-|z). 

We prove the theorem in two steps. First in Lemma 12, we bound the Wasserstein 
distance between the distributions II 7 and II by showing that for all 7 > 0, 


d w (fl 7 ,n) < y^/d. 


(32) 
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Then in Lemma 


14 


we bound the total variation distance between II 7 and II 7 by 
showing that for all 7 € ( 0 , 70 ], 

d tv (n 7 , n 7 ) < 2 (l - e"^ (2) ) . (33) 

It is clear from their definitions that both the Wasserstein metric and the total vari¬ 
ation metric are upper bounds for the metrix (3, and the Theorem [ 6 ] follows by com¬ 
bining (32) and (33). The proof of Lemma 14 relies on a comparison result between 
the functions h and /i 7 established in Lemma 13 that is also of independent interest. 


Lemma 12. Let II 7 be the probability measure defined in (13). For all 7 > 0, 

\j^\fld ^1 - ^IE(IMIo)^ < d w { ft 7 ,n) < y/dry^l- ^IE(IMIo), (34) 

where 7 is a random variable on A with distribution given by the 5-marginal of II, 
that is P (7 = 5) (x its f Rd e~ h ^ 9 ^ p.s{d9), 6 € A. 


Proof. For 5 £ A, we set 

C{5) = f [ e-WMfisid0), and C = V 7 t s C{8). 


(35) 


<5eA 

Using this notation, we can write 

fl(S,de\z) = 7 ^^U(d9\6,z), where n(d0|£,z) d = /ifidO). 

G G(d) 

For 7 > 0, we notice that the normalizing constant of II 7 is 


C d ^ f 


1 

27T7 


e -^\\e-e s \\ 2 e -h(e s \5) d9 


^vr 5 / e h( ~ e \ 5) ns(dd), 


(36) 


which is the same as the normalizing constant of the posterior distribution II. Hence 
we get the following factorization of n 7 , 


n 7 (5,d0|z)= ns ^ iL(de\6,z), where n(d6>|<5, z) d = e~ h ^d9. 

G G yd J 

We build following coupling of II and n 7 . First we generate t) G A from the dis¬ 
tribution 5 i-a ns ^7 ; and we generate i?|7 ~ Ll(d0|z,?y). Hence clearly, (77, t 9 ) n. 
Given (77, t?) , we generate 1 ? as follows. If r/j = 1, we set idj = ()j. Otherwise we 
generate independently Z 3 ~ N(0,1), and set d 3 = ^7 Zj. It is also easy to check 
that (77, $) ~ fl 7 . 
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For any Lipschitz function on 0 with Lipschitz constant less of equal to 1, we have 


/(M)fi 7 (d<f,d 0 )- / me)iL(d 6 ,de) 


E f(v,&) - fir),#) 


< IE 


11^11 


< y/drf\/l- ^E(||? ? || 0 ). 


Now consider the function fo(5,9) = ^ )Tb =1 \9fi. It is Lipschitz with Lipschitz 
constant 1. Hence 


d w (n 7 ,fl) > E / o fa,0)-/oM) 


i: r)j =0 


E \ z i 

(l - ^E(Hvllo)^ , 


and the result is proved. 


□ 


Lemma 13. Assume *0 and fix S € A. For all 8 6 M rf , 


with 


h(0$\8) + ^-\\e- 8 S \\ 2 > h^9\5) > h( 8 s \ 6 ) + ^||8 - 8 S \\ 2 - r 7 (0|<S), (37) 

r 7 (tf, = f <W(0) - W(0 5 ), 0 - 0*)> + P5- V£(0) + <5 • g (6 S \5) || 2 , 
and where g(9$\S) denotes a sub-gradient of P(-\5) at 6 $. It follows in particular that 
for all 9 e M d , hy(9\8) t h{9\5), as 7 f 0 . 


Proof. From the definition we have 
hry(9\5) = mill 


ueR d 


1 


*(0) + (V£(9),u -9) + P(u\5) + ||u - 8 1| 2 

2 7 

1 


< £( 8 ) + <W(0), 9s - 8 ) + P(6 S \5) + ^-\\8 - 8 S || 2 . 

27 

By convexity of £, £(9 ) + (V£(9), 85 — 9) < £{9s), which proves the first inequality in 


(]37j) . To prove the second inequality, we start by using again the convexity of i to 

1{9)>£(9 5 ) + (V£{9s),8-9s). 


write for all 8 G 


Hence for all 9 G M d , 

m + (V£( 8 ), J^e\5) -9)> £{9 S ) + (V£(9s) - V£(9),9 - 9 S ) 

+ (\/£(9),J 1 (9\5)-9s). (38) 
By 10 P(-|<5) is convex, and if g(9s\8) denotes a sub-gradient of P(-|<5) at 9s, we have 
P(J y (0\6)\8) > P(0s\S) + (g (W, Jy(9\5) - 9 S ). (39) 
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(38)-(39) together with the expression of h n imply that 


h^{9\5) > h(Qs\5) - (W(0) - W(0 5 ), 9 - 9 S ) 

+ <W(0) + g (0*1$), -9s) + ^||o - J^(9\6)\\ 2 . 

Since J 7 (0|<5) G MJj, we can split ||0 — J 7 (0|(5)|| 2 as ||0 — 0,5|| 2 + ||05 — J 7 (0|<5)|| 2 . We use 
this in the last inequality to conclude that 

M0|<S) > H9s\8) + ^-\\9-9 5 \\ 2 -(V£(9)-V£(9s),9-9s) 

+ <V£(0) + g (0#), M9\5) - 9 S ) + ^|| M6\8) - 9 S || 2 
> h(9 s \6) + ^-||0 - 9 5 \\ 2 - <W(0) - Vl{O s ),e- 9 S ) 

— ^||<5 ■ V£(0) + 6 ■ g {9s\S) || 2 , 


as claimed. In the last inequality, the 6 appearing in front of V£(9) + g(9\5) comes 
from the fact that J 7 (0|<5) — 9s G 

It is obvious from its definition that h^(9\S) is non-decreasing as 7 | 0. If 9 ^ Mjj, 
then ||9 — 9 ■ d|| > 0, and then both extreme sides of (37) converges to +00 = h(9\5) 
as 7 | 0. If 9 G Mjj, then ||9 — 9 ■ <5|| = 0 and both extreme sides of (37) converges to 
h{9 • <5|<5) = h(9\S) as 7 4- 0. □ 


Lemma 14 . Assume Suppose that there exists 70 > 0 such that n 7o (-|z) is 
well-defined. Then for all 7 G (0,70], n 7 (-|z) is well-defined and 


d tv (n 7 ,n 7 )< 2(i- e -^)). 

Proof. For all 7 > 0, we define 

CJS)= [ e~ h ^ m d9, and C 7 = V 
jRd , 


(40) 


The term C 7 is the normalizing constant of fl 7 . The function h 7 is nondecreasing as 
7 | 0 . Hence, if C 70 < 00, then C 7 < 00 for all 7 G (0,70], which guarantees that fl 7 
is well-defined for all 7 G (0,70]. For the remaining of the proof, we fix 7 G (0,70]. 
To derive the total variation majoration, we start with a bound on C 7 . Using the 
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second inequality of (37), we write 


(2vr 7 )- d/2 C 7 = £ 


< 


E 


7T<5 


T5 


27T7 


27T7 


/ 

•/R c 

/ 


—/u 


'Wd0 


s r-r(«,0) e -£ll«-fcll 2 e -h(0«|«) d 0. 


where 


r 7 (<j, 0) = <W(0) - ve(6 s ), e - 0 S )) + 1\\6 • V£(9) + 5-g (0 5 |«S) || 2 . 

In view of this last inequality, and the definitions of ILy, and g-y, we get 


(2vr 7 ) d / 2 C 7 (2) 

C ~ - 


(41) 


The total variation bound between li 7 (d, dO\z) and 1I 7 (<5, A9\z) now follows from a 


comparison of the two measures. Indeed, Using the first inequality of (37), and for 
7 E (0,7o], we deduce that 


iMMOk) = 

c;" (277 


> 


e -h^e\s) de0 


Mill 

2 e -2yll 0 - e hl 2 e -M^I<5) d6 / 


C 


-n 7 (d, dO\z), 


(2777) 2 c 7 
> e-^ (z) n 7 (5,d0|2), 


(42) 


using (41). By a standard coupling argument (see e.g. Lindvall (1992) Equation 5.1), 
the minorization (42) implies (40). □ 


7.3. Proof of Theorem [8j It suffices to establish the stated bound on g^(z), and 
apply Theorem [6j From its definition, we have 


rf-imio 


,e-y{z) - 


2 9sll2 e h ^ s) d9 


S<5eA ( 27T7^ 


d-WSWo 

2 


e -^\\e-0s\\ 2 e -h(e s \6) dd 





















28 


YVES F. ATCHADE 


It follows from H[2]that 


r 7 (M) < (Vm-Vt{0 s ),6-6 s ) + ^\\vm-Vl(p s )t 
+ 3 l\\5.V£(ds)\\ 2 + ^\\ g ( e s \ S ^)\\ 2 

< lJi + \\e - e 5 \\ 2 + 3 7 L 2 ^9s) + yc(<5) + 3 7 L 2 P(0 5 |«5).(43) 


We set hj = 1 — 2 7 Li (1 + ), and a = 3L 2 . Then (43) gives 


/ 

Jr c 


e rs(^d) e -^\\e-e s \\ 2 < e ^-c(S) 

x f e _ ^ 7 ll 6 ' _e '' 5 ll 2 e -( 1 -7a)^)-(l-7«)-P(^|5)^ > ( 44 ) 

J M d 


Notice that the integral on the right-side of (44) can be factorized as the product 
of two integrals, with one integral taken over the components for which 5j = 0, and 
the other taken over the components for which 5j = 1. We introduce some notation 
to do this rigorously. Fix S E A, and s = ||<5||o- For a given function / : —> M, we 

define /M : I s -> M as /M(u) = f(u s ), where u 5 € M d , and uf = 0 if Si = 0, and 
vA = u^j x if <5 ? - = 1. With this notation, and for 4 7 Li < 1 (which implies that 

Z^fc= 1 °k J 


hry > 0), the integral on the right-hand side of (44) is equal to 


27r 7 

kry 


J 

JR* 


2 -(l-7a)£W(u)-(l-7a)p[ s ](u|5) c | ti _ 


/ 

JR C 


A similar calculation on the denominator of e ei ^ gives 

3 -^ll*-^lVMW d0 = (27r7) ^ [ e -e [s] M- pls] (u\5) dU ' 

Jr s 

We conclude that 

E 5s a tt S e^ c{6) (^) ^ f R , 


oQ~t( z ) 


< 


«> du 


(45) 


For 4 7 Li < 1, and using the inequality log(l — 2x — 3x 2 ) > — 6x, valid for all 
x G [0,1/4], we have 


d—s 

1 \ 2 


= exp 


d — s 


log (l - 2 7 Li - 3 7 2 L 2 ) 


< pSd'yLi' 


(46) 
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Fix uo E M s , arbitrary. Since 7 > 0 is taken such that 47 L 2 < 1, we see that 
7 a = 37 L 2 <3/4. Then by the convexity of fW we have 


(1 — 7 a)fl s \u) = —7a^^(«o) + (1 — 7 a)&\u) + ^o1^ s \uq) 

> — 70^ («o) + ^ (70^0 + (1 — 7a)u) . 

Similarly, by the convexity of pW(-|< 5 ), 

(1 — ^a)P^(u\5) > — 7 aP^(«o|< 5 ) + P^ ( 70^0 + (1 — 7 a)«|< 5 ). 

Using these last two inequalities, and the change of variable (1 — 7 a)u + r yauo = w, 
we conclude that 


J R s 

< e 7a(d s l(«o)+P W («o|5)) _ 7Q )-« f e -d s l(«)-PW( U |5) du _ 

Setting P/z) = f max< 5 e A inf ue Rs + pW(u|<5)], and using the inequality log(l — 

3x ) > —6x, x E [0,1/4] we obtain, 


/ 


3 -(l-7ai)£(u)-(l-7a 2 )P(u) ( j u < e iaK(z) e 6d-yL 2 / 


It follows from this last inequality, (|46j) and (45) that 

37 . 


£>7(2) < —- maxc(<5) + 3^d(Li + 2L 2 ) + 37 L 2 TZ{z), 
2 


as claimed. 


□ 


7.4. Proof of Corollary [ 9 j We show that mm hold and apply Theorem [ 8 j 
The function t is clearly convex and Vf(0) = — ^X'{z — X9). Hence 43 1 > holds. 
The elastic-net density in (22) is log-concave and continuous, which implies that 
P(-\5) is convex and lower semi-continuous for any given 5. Furthermore, For 9 E M^, 
sign(0) is a subgradient of x 1-7 ||x||i at 6. Hence g{9\5) c = ^-sign(0) + is a 

subgradient of P(-|<5) at 9 E Hence beholds. 

From the expression of V£, we have 


||Vf(0 2 ) — V£(0i)|| < Li\\9 — 0 2 ||. 

with L\ = f A max (X / X)/a 2 . Furthermore, for all 5 E A and 9 E M^, 

\\s • Vf(0)|| 2 = 2 .{z - X9)'XsX' s (z - X9) < IL^Jz - X9\\ 2 . 
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From the expression of ^r(• |<5), we have 


< 


aAi 


a- 


lo + 


2(1 ~ a )\ 2 


<j- 


a^ll^lli + (1 - <*)^l 


= c W + 2(1 f X2 pm, 


G 


< c(<5) + 2LiP(#|<5), 

where c(J) ||<5||o, and using the assumption (1 — a)A 2 < A max (AMY). These 

inequalities show that 1^] holds. The corollary then follows from Theorem [8] by noting 
that maxj c(5) < d, and IZ(z) < £(0) = I 

□ 


10 


Since e = f Z—X9+ ~ N(0, cr' 2 I n ), by standard Gaussian 


7.5. Proof of Theorem 

tail bound and union bound inequalities, we have 

Af 


P [Z 0 £] = P 


max I I > „ 

l<fc<a z 


< 2 exp log(d) — 


A? 


8na 2 n(l) J d ’ 
given the choice Ai = 4cry / u(l)nlog(<i). By Jensen’s inequality, 


(47) 


E [g-y(Z)\Z G £\ < logE 


e e-t( z )\Z e £ 


< — log (l — ^ j + logE e Bl ^ z ^ls{Z) , on Z e £. 


In the particular case of the linear model, we have 


0 Q^{Z) _ 


\Z-X6 s \\ a e -%\\0 t \\i 


d9 


f \ \ H d H0 n _1 

S<5 775 V2a^J JrII^Ho e 2,7 


HZ-AHI 2 -AiyuHj 


e 


Mu 


Using Lemma 17 of Atchade (2015), the denominator of satisfies the lower 

bound 


-K\\z-xe *\\ 2 Ai ye || 

ei* 7 " " e?" " 


^ , A, n 


-A-II^-AHI 2 -AflMlij 

e 2 a 211 0 " e ct 2M ""du 


> T5* 1 + 


na 2 K(s i ,) 


(48) 
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As in the proof of Theorem 


and setting M 7 = f ^X'X ^ + ^X'X^j , we have 


3 71 


rs(6,0) < (vm-v£(9s),e-0 s ) + ^-\\vi(d)-ve(9s)\\ 2 

+^\\ 8 . Xl { 9 5 )\\ 2 + ^\\ g { 9 s \ 5 A )\\ 2 

< (9 — 9s)'M^(9 — 0s) 

+^\ max (XsX' 5 )\\Z - X6 s f + ^ (^j ||<5||o. 

Set as = f [2> / a 2 )\ max {XsX' 5 ), and H 1 = f I — 27 M 7 . Using the last inequality, we get 
the bound 

f e M0,tfj e -^iF-^ir e -^ii^-^ir e -^ii^i!i d 0 

Jr 


rXm p -^\\ 0 - d d 2 e -^\\z-X0 s f e -^\\04 1( 

^(^) 2 Pllo f e -£(0-0tyHi(0-e s ) e -±ffi\\z-xes\\ a -£\\0 s \\ ld ^ ( 49 ) 

J E d 


< e 


And if we call Js{Z) the integral on the right-side of (49), then by Fubini’s theorem, 


E 


1 £ (Z)e^ llZ ~ Xe * 112 e^ lMl J 5 (Z) 


f e 2 V 

J K d 


-i-(e-6 s yHxe-o s ) e -% 


x E 


l £ (Z)exp^\\Z-X94 2 -^^\\Z-X0 s \\ 2 ) d0. (50) 


We write 


1 

2U 2 


\Z - X9 S || 2 =2^|| Z- X9 *|| 2 + ^X'(Z - X0*), 0 5 - 9^ 


+ ^(e 5 -6 ir y(X , X)(9s-9*), 


and for Z E £, | (^2 ^(Z — X9 ir ), 9s — #*) | < (Ai/2<t 2 )||^ — 0*||i. Therefore the 
expectation on the right-side of (50) is upper bounded by 


e2<r 


h\\ d s-0*\\i e - L £S s -Vt-9*y(.X'X)(0s-6 i ,) E 


^ IIZ-X 6 UH 2 

g2o-2M *11 


= ett\\ 0 i- 0 *h e - l ^ s -i 0 s- 0 *Y( x,x )( 0 s- 0 *) ( _L 


1 - 7 as 


n/2 


Letting 


m = ~ Ml - ll^lli] + ^\\9 - 9 , 111 - l -^^{9 - e*)\X'X){9 - 9 *), 


2d 2 
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for 9 £ K d , we conclude that 


E 


1 £ (Z)e^ llZ ~ Xe * 112 e^ ll9 * h J S (Z) 


< 


1 


1 - 7 a 5 


n/2 


e ~^ Y ( 9 ~ s s)'H^(9-e s ) e B(e s ) d g ^ 5 -^ 


Using an argument that can be found in Castillo et al. (20151 (proof of Theorem 10) 


and also in Atchade (2015) (proof of Lemma 5), it can be shown that the function B 
satisfies 


B{9)< ^%-^||0-0*||i, 6eR d . 


(52) 


Combining (f52[) , (|5T|), (|49]) , and (|48|), we conclude that 

ia^K \ 7Tse 2 77 


E 


ls(Z)e 


Q-y(Z) 


< — I 1 + 
tiy 


na 2 R(s*) \ s * ^ _ _ ¥( ^, || d ||o e _ « log(1 _ 7<li) 


A? 


Ai 


llo / ! X 


e -J- {e -9 s yH^e-e s ) e -^\\e s -e4i d6 . (53) 


2a 2 ) \2 tt / 

Since ( 7 /a 2 )X max (X'X) < 1/4, and using the inequality log(l — 2x — 3x 2 ) > — 6x, 


valid for all i£ [0,1/4], it can be shown that the integral on the right-side of (53) is 
upper bound by 

8ct 2x l|S|l ° 


=71 ( 2 ^) !t P k ( .S T '> A ’'- v » 

Ai 


We conclude that 


E 


,‘^h £ (z) 1 < - (1 + 5 ^/ 4 )" e sfe e & T '« A ' A ) 

1 7T$* V A 1 J 

^7r 5 e^(^) 2|l ' 5||o e-t 1 °g( 1 -7^) e iog(4)||5||o_ 


Since — § log(l — 7 as) < 37 nA max (A'LY)/< 7 2 , and with Ai = Aay/K(l)n log(d), 
e ^( Z h £ (Z) < - log(7T 5 J + S* log (1 -f- K '( S *) .. ^ 11 


logE 


+ 128 


log (d) 


161og(d) J 

+ ^ (nA max (A'A) + Tr(A'A)) +log^7r (5 e^(^) Pl|o e log ( 4 )ll 5 llo. 
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Set A d = log(4) + ^ Then 

io g y>^ii*"° = 

<5 

< 

for 1) < u — 1. Also, 

2 

- log(7T 5 J < US*log(d) + 

The theorem is proved. 

□ 


\og^( d )q s (l-q) d - s e As 

s =0 ' V ' S ' 

dlog(l + qe A (l-e _A )) , 
/A \ 2 

4 log(4) + 67 ( , 
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